Part 1.1
Four \(x-y\) datasets which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.), yet are quite different.
anscombe
A data frame with 11 observations on 8 variables.
x1 == x2 == x3 the integers 4:14, specially
arrangedx4 values 8 and 19y1, y2, y3, y4 numbers in (3, 12.5) with mean 7.5 and
standard deviation 2.03Tufte, Edward R. (1989). The Visual Display of Quantitative Information, 13–14. Graphics Press.
Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899]
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 | |
|---|---|---|---|---|---|---|---|---|
| Min. : 4.0 | Min. : 4.0 | Min. : 4.0 | Min. : 8 | Min. : 4.260 | Min. :3.100 | Min. : 5.39 | Min. : 5.250 | |
| 1st Qu.: 6.5 | 1st Qu.: 6.5 | 1st Qu.: 6.5 | 1st Qu.: 8 | 1st Qu.: 6.315 | 1st Qu.:6.695 | 1st Qu.: 6.25 | 1st Qu.: 6.170 | |
| Median : 9.0 | Median : 9.0 | Median : 9.0 | Median : 8 | Median : 7.580 | Median :8.140 | Median : 7.11 | Median : 7.040 | |
| Mean : 9.0 | Mean : 9.0 | Mean : 9.0 | Mean : 9 | Mean : 7.501 | Mean :7.501 | Mean : 7.50 | Mean : 7.501 | |
| 3rd Qu.:11.5 | 3rd Qu.:11.5 | 3rd Qu.:11.5 | 3rd Qu.: 8 | 3rd Qu.: 8.570 | 3rd Qu.:8.950 | 3rd Qu.: 7.98 | 3rd Qu.: 8.190 | |
| Max. :14.0 | Max. :14.0 | Max. :14.0 | Max. :19 | Max. :10.840 | Max. :9.260 | Max. :12.74 | Max. :12.500 |
The summary function above gives the summary statistics
for each of the columns (x1, x2, x3, x4, y1, y2, y3, y4) in
the anscombe dataset.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 27.51000 | 27.510001 | 17.98994 | 0.0021696 |
| Residuals | 9 | 13.76269 | 1.529188 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x2 | 1 | 27.50000 | 27.500000 | 17.96565 | 0.0021788 |
| Residuals | 9 | 13.77629 | 1.530699 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x3 | 1 | 27.47001 | 27.470008 | 17.97228 | 0.0021763 |
| Residuals | 9 | 13.75619 | 1.528466 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x4 | 1 | 27.49000 | 27.490001 | 18.00329 | 0.0021646 |
| Residuals | 9 | 13.74249 | 1.526943 | NA | NA |
Above, we see the ANOVA table for the output of each of the four
linear regression models created in the for-loop, each one regressing
y on x using the 11 observations in each of
the four “datasets”. After running the for-loop to run a simple linear
regression model on each of the datasets, we then print the analysis of
variance table using the kable() function, which cannot be
run in a for-loop, and we see that the ANOVA tables for each model are
extremely similar.
| lm1 | lm2 | lm3 | lm4 | |
|---|---|---|---|---|
| (Intercept) | 3.0000909 | 3.000909 | 3.0024545 | 3.0017273 |
| x1 | 0.5000909 | 0.500000 | 0.4997273 | 0.4999091 |
|
|
|
|
The numerical properties of the simple linear regression model for each of the four datasets are extremely similar: the intercepts, slope, standard errors, t-statistics and p-values are all almost identical.
The plots above tell a very different story than the ANOVA tables and
the numerical properties of the linear regression modes. Using these
visualizations, we see that the trends, distributions, and shapes of the
data are far from identical: in fact, they are each very distinct from
one another and significantly different conclusions should be drawn from
each dataset. If one had not visualized this data and simply relied on
the numerical properties as we generated first, deeply incorrect
assumptions might have been made.
In x1, we see a set of data that looks to be a decent
fit for the linear model. x2 tells a different story: we
see a quadratic trend rather than a linear one. In x3, the
data does appear to be linear, in fact with significantly less variance
than in x1-but a single positive outlier distorts the slope
of the model. Finally, x4 appears to represent categorical
data, again with a single outlier.
Part 1.2
‘Monstrous Costs’ by Nigel Holmes (1982)
gapminder interactive plot from
the introduction.Part 2.1
Part 2.2
From the year 1957, I identified the red marker that lies at roughly
40 on the lifeExp axis, at the center between “1e+03” and
“1e+04” on the gdpPercap axis. From the perspective of
“preattentive pop-out”, I think this particular marker caught my
attention for two reasons: the color of the marker and its position on
the plot relative to the other markers. To me, the red color strikes a
stronger contrast against the grey background of the plot than the other
colors do, and it also presents a strong contrast against the rest of
the other colors themselves. Additionally, whereas many of the markers
are moderately or tightly clustered together, this marker separates
itself from many of the others by visual distance, and not only does it
not lie a tight cluster, but it is positioned along the outside of the
general grouping of markers. As the mind likes to place structure on
objects based on their proximity to each other, this “outlier” nature
makes it stand out on a quick visual scan.
I identified the country as Angola. I was not surprised to find that it had one of the lowest life expectancies, but I did find it surprising and interesting to learn that it has such a low life expectancy given that it seems to fall roughly around the 60th percentile in terms of GDP per capita of all of the countries. Far from affirming an intuition, I found this pairing of information (relatively high GDP per capita but one of the lowest life expectancies) to be unexpected and interesting to consider about why that might be so.
Upon playing the animated sequence, the marker that immediately caught my attention was the left-most large green marker that I later identified as belonging to China. This marker caught my attention on a pre-attention level due to its size. It is one of two markers that are significantly larger than any others, and even larger still than the next largest, and size of an object is known to be a factor in pre-attentive search. As the animation begins, it also displays an eye-catching vertical bouncing motion that further drew my attention. I was unsurprised to find out that the marker represented China, since I knew that marker size was correlated to population size. In general, the animation accorded with my expectations and what I know about the country, however, I did learn something interesting from the way the marker had a steep and significant single vertical drop in life expectancy around the year 1960 before continuing a slow and steady increase in both GDP per capita and life expectancy. This spurred me to research the history of China around the time, and I found out that this time marked the end of the period known as the Great Leap Forward, in which Chairman Mao Zedong led a campaign to transform China from an agragrian society to an industralized one, which led to a massive famine and the deaths of tens of millions of people.
Part 2.3
Plots of China and India in 1952 demonstrate the gestalt rule of similarity: things that look alike seem to be related. China and India are both represented by markers that are similarly very large, both significantly larger than any other marker, and therefore seem to be related.
Plots of Equatorial Guinea and Botswana in 2007 demonstrate the rule of proximity: things that are spatially near to one another seem to be related. These two countries are located extremely closely together on the plot, and therefore appear to be related to each other in terms of GDP per capita and life expectancy.
Plots of Ireland and the United States in 2007 demonstrating the rule of figure and ground: visual elements are taken to be either in the foreground or in the background. Here, it appears that Ireland is in the foreground, that it is somehow ‘on top’ on the United States, which appears to be ‘behind’ it. In the context of this particular plot, this effect holds no statistical meaning.
Plots of all countries in 1952 and in 2007 demonstrate the concept of common fate: we see that most of the markers have moved in a positive direction along both axes, therefore much of the cloud moves in a ‘southwest’ to ‘northeast’ direction across the plot. The rule of common fate says that elements sharing a direction of movement are perceived as a unit. Here, this rule has the effect of suggesting that all countries are moving in a similar direction and following a similar trend of progress as a general unit, when this may not be the case.
Part 2.4
One inequality that is apparent when watching the animated plot is that some countries make significant gains to their GDP per capital and/or life expectancy, while other countries move very little in either or both of the directions. One way to highlight this inequality in growth (along the axis for both variables) would be to add a feature to the final static plot for the year 2007 that quantifies how much each country changed from the year 1952 to 2007 for both variables. This statistic could be added to the tooltip feature that already exists for each country, or perhaps be added on separately with a method to color code countries by amount of progress made: high levels of change, medium amounts of change, and little/no change - with potentially an additional indicator for countries that may have moved in the negative direction. This could be a helpful method to visualize and identify this kind of “inequality of progress”, since it is difficult to observe and track this oneself while studying the animation, and certainly on a large scale, if we want to identify any patterns in which countries made the most and least amounts of gains in GDP per capital and life expectancy. Particularly if we created more than three levels of progress to indicate with color, such as 10 or 20 bins - or even a continuous color gradient rather than discrete chunks - it would begin to give us a general sense of what the distribution of progress statistics might look like.